The dataset is obtained from the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). This report will be demostrating on what are the popular pickup spots of two different time frame (Morning, Evening) during a day.
The training dataset is downloaded and unzip from Kaggle website, we can see that there are totally 1.45 million rows and 11 columns.
# The URL to download the train.zip is from the kaggle website:
# website:https://www.kaggle.com/c/nyc-taxi-trip-duration/data/
unzip("train.zip")
trainDS<-c()
if (file.exists("train.bin"))
{
trainDS<-readRDS("train.bin")
}
if (file.exists("train.csv"))
{
trainDS<-read.table("train.csv", header=TRUE, sep=",")
saveRDS(trainDS, file="train.bin")
}
dim(trainDS)
## [1] 1458644 11
The columns available are shown below. We will be using the following columns (pickup_datetime, pickup_longitute, pickup_latitude) to demostrate the popular pickup location for the taxi ride during morning time (7 am to 9 am), and evening time (5 pm to 7 pm).
names(trainDS)
## [1] "id" "vendor_id" "pickup_datetime"
## [4] "dropoff_datetime" "passenger_count" "pickup_longitude"
## [7] "pickup_latitude" "dropoff_longitude" "dropoff_latitude"
## [10] "store_and_fwd_flag" "trip_duration"
We first preprocess the data to categorize the pickup time frame of taxi ride. A new column named (“pickup_timeframe”) is added to dataset. Hour between 7 am to 9 am is assgiend to “Morning”, while hour between 5 pm and 7 pm is assigned to “Evening”. We also change the pickup_longtitude and pickup_latitude to numeric for better processing later. For speed up the process, we will limit only the morning /evening data for the analysis later.
finalDS <- trainDS %>%
select(id, pickup_datetime, pickup_longitude, pickup_latitude) %>%
mutate(pt = as.POSIXct(strptime(pickup_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
mutate(pickup_timeframe=case_when(
hour(pt) >= 7 & hour(pt) <=9 ~ "morning",
hour(pt) >= 17 & hour(pt) <= 19 ~ "evening",
TRUE ~ "other")) %>%
mutate(pickup_timeframe=as.factor(pickup_timeframe)) %>%
mutate(pickup_longitude=as.numeric(as.character(pickup_longitude)))%>%
mutate(pickup_latitude=as.numeric(as.character(pickup_latitude))) %>%
filter(pickup_timeframe %in% c("morning", "evening"))
dim(finalDS)
## [1] 447707 6
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 2153024 115.0 6861544 366.5 6525716 348.6
## Vcells 5147223 39.3 64251174 490.2 60706561 463.2
To draw a map of popular pickup location for morning, we subset the data for morning ride to a new data frame named morningDS. We can see there are around 190K of morning taxi rides.
morningDS <- finalDS %>%
subset(pickup_timeframe=="morning") %>%
select(pickup_longitude, pickup_latitude, id)
dim(morningDS)
## [1] 190316 3
To show the most popular pickup location for the taxi ride, we will first run the kmeans to calculate the most popular location in the geospecial space. After sorting the data, we can see that the top 3 location are around these three longitude/latitude (-73.95214 40.77422), (-73.98949, 40.75722), and (-73.97089, 40.75831).
set.seed(7553)
kms <- kmeans(cbind(morningDS$pickup_longitude, morningDS$pickup_latitude), centers=10)
morningCluster <- cbind(morningDS, kms$cluster)
names(morningCluster) <- c("lng", "lat", "id", "cluster_id")
c <- morningCluster %>%
group_by(cluster_id) %>%
summarise_at(vars(id), n_distinct) %>%
mutate(lng=kms$centers[,1], lat=kms$centers[,2])
names(c) <- c("cluster_id", "count", "lng", "lat")
c <- c %>% arrange(desc(count))
print(c)
# A tibble: 10 x 4
cluster_id count lng lat
<int> <int> <dbl> <dbl>
1 7 34562 -73.95214 40.77422
2 2 33327 -73.98949 40.75722
3 10 32008 -73.97089 40.75831
4 4 24122 -73.98286 40.73742
5 3 20392 -73.99952 40.73765
6 9 17811 -73.97637 40.78438
7 6 12233 -74.00853 40.71438
8 8 7133 -73.83112 40.71606
9 1 5979 -73.95028 40.80955
10 5 2749 -73.98543 40.68444
We display the ride data in of the ride using leaflet, centered on the most popular location obtained from the previous method. we can clearly see the most popular pickup locations are in lower and middle Manhattan.
library(htmltools)
morningMap <- leaflet(c) %>%
addTiles() %>%
setView(lng=-73.95214, lat=40.77422, zoom=10) %>%
addCircleMarkers(lng=~lng, lat=~lat, popup= as.character(c$count),
popupOptions=popupOptions(riseOnHover=TRUE))
morningMap
The evening taxi ride data has 257K rows. We subset the data and prodices the eveningDS data frame.
eveningDS <- finalDS %>%
subset(pickup_timeframe=="evening") %>%
select(pickup_longitude, pickup_latitude, id)
dim(eveningDS)
## [1] 257391 3
Using kMeans method, we can add circle markers to the map to show the most popular locations.
kmsEvening <- kmeans(cbind(eveningDS$pickup_longitude, eveningDS$pickup_latitude), centers=10)
eveningCluster <- cbind(eveningDS, kmsEvening$cluster)
names(eveningCluster) <- c("lng", "lat", "id", "cluster_id")
c2 <- eveningCluster %>%
group_by(cluster_id) %>%
summarise_at(vars(id), n_distinct) %>%
mutate(lng=kmsEvening$centers[,1], lat=kmsEvening$centers[,2])
names(c2) <- c("cluster_id", "count", "lng", "lat")
c2 <- arrange(c2, desc(count))
c2
## # A tibble: 10 x 4
## cluster_id count lng lat
## <int> <int> <dbl> <dbl>
## 1 5 45013 -73.97128 40.75847
## 2 4 43066 -73.98671 40.74557
## 3 8 35371 -73.98578 40.76372
## 4 1 34183 -73.95428 40.77492
## 5 7 25090 -74.00164 40.73868
## 6 9 23263 -73.98984 40.72403
## 7 6 19208 -73.97481 40.78688
## 8 10 14588 -74.01278 40.70865
## 9 2 13029 -73.82948 40.71486
## 10 3 4580 -73.94917 40.81307
From the pickup ride count above, we can see the most popular locations are of these three: (-73.99234, 40.74488), (-73.97224, 40.75665), and (-73.98521, 40.76382). We centered our map in the top location.
eveningMap <- leaflet(c2) %>%
addTiles() %>%
setView(lng=-73.99234, lat=40.74488, zoom=10) %>%
addCircleMarkers(lng=~lng, lat=~lat,
popup=as.character(c2$count),
radius=~count/1000,
clusterOptions = markerClusterOptions())
eveningMap